Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool
نویسندگان
چکیده
With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is ordered or classified by group or category. The problem of finding best such grouping is still there. This paper discusses the implementation of k-Means clustering algorithm for clustering unstructured text documents that we implemented, beginning with the representation of unstructured text and reaching the resulting set of clusters. Based on the analysis of resulting clusters for a sample set of documents, we have also proposed a technique to represent documents that can further improve the clustering result. Keywords—Information Extraction (IE); Clustering, k-Means Algorithm; Document Classification; Bag-of-words; Document Matching; Document Ranking; Text Mining
منابع مشابه
Enhancing Computer Inspection Using Document Clustering for Analysis
In document analysis, Computers having huge amount of data files really creates disorder to analyze it, most of the data consist in those files will be unstructured whose analysis will be difficult. Therefore, we present an approach that reduces the effort of analysis by clustering the document. Clustering is a division of data into groups of similar objects. The clustering techniques used in o...
متن کاملخوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملIn-depth Interactive Visual Exploration for Bridging Unstructured and Structured Document Content
Semi-structured data refers to the combination of unstructured and structured data. Unstructured data is free text in natural language, while structured data is typically stored in tables and following a data schema. Recent statistics shows that 80% of the data generated in the last two years is unstructured. However, one interesting observation is that free text usually comes along with some s...
متن کاملAkshaya: A Framework for Mining General Knowledge Semantics From Unstructured Text
We report a tool called Akshaya, which implements a framework to mine four types of “general knowledge semantics” (analytical semantics) from unstructured text. The semantics being mined are semantic siblings, topical anchors, topic expansion and topical markers. The framework provides options to embed more such general knowledge semantic mining algorithms into it. We use a term co-occurrence g...
متن کاملData Mining from Document-append Nosql
Due to the unstructured nature of modern digital data, NoSQL storages have been adopted by some enterprises as the preferred storage facility. NoSQL storages can store schema-oriented, semi-structured, schema-less data. A type of NoSQL storage is the document-append storage (e.g., CouchDB and Mongo) which has received high adoption due to its flexibility to store JSON-based data and files as at...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1007.4324 شماره
صفحات -
تاریخ انتشار 2010